Compiling Deep Learning Models for Custom Hardware Accelerators

نویسندگان

Andre Xian Ming Chang

Aliasger Zaidy

Vinayak Gokhale

Eugenio Culurciello

چکیده

Convolutional neural networks (CNNs) are the core of most state-of-the-art deep learning algorithms specialized for object detection and classification. CNNs are both computationally complex and embarrassingly parallel. Two properties that leave room for potential software and hardware optimizations for embedded systems. Given a programmable hardware accelerator with a CNN oriented custom instructions set, the compiler’s task is to exploit the hardware’s full potential, while abiding with the hardware constraints and maintaining generality to run different CNN models with varying workload properties. Snowflake is an efficient and scalable hardware accelerator implemented on programmable logic devices. It implements a control pipeline for a custom instruction set. The goal of this paper is to present Snowflake’s compiler that generates machine level instructions from Torch7 model description files. The main software design points explored in this work are: model structure parsing, CNN workload breakdown, loop rearrangement for memory bandwidth optimizations and memory access balancing. The performance achieved by compiler generated instructions matches against hand optimized code for convolution layers. Generated instructions also efficiently execute AlexNet and ResNet18 inference on Snowflake. Snowflake with 256 processing units was synthesized on Xilinx’s Zynq XC7Z045 FPGA. At 250 MHz, AlexNet achieved in 93.6 frames/s and 1.2 GB/s of off-chip memory bandwidth, and 21.4 frames/s and 2.2 GB/s for ResNet18. Total on-chip power is 5 W.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks

Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks Convolutional neural networks (CNN) have achieved major breakthroughs in recent years. Their performance in computer vision have matched and in some areas even surpassed human capabilities. Deep neural networks can capture complex non-linear features; however this ability comes at the cost of high computational and memo...

متن کامل

Snowflake: A Model Agnostic Accelerator for Deep Convolutional Neural Networks

Deep convolutional neural networks (CNNs) are the deep learning model of choice for performing object detection, classification, semantic segmentation and natural language processing tasks. CNNs require billions of operations to process a frame. This computational complexity, combined with the inherent parallelism of the convolution operation make CNNs an excellent target for custom accelerator...

متن کامل

Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC

Convolutional Neural Networks (CNN) have been widely deployed in diverse application domains. There has been significant progress in accelerating both their training and inference using high-performance GPUs, FPGAs, and custom ASICs for datacenter-scale environments. The recent proliferation of mobile and IoT devices have necessitated real-time, energy-efficient deep neural network inference on...

متن کامل

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Deep Convolutional Neural Networks (DCNN) have proven to be very effective in many pattern recognition applications, such as image classification and speech recognition. Due to their computational complexity, DCNNs demand implementations that utilize custom hardware accelerators to meet performance and energy-efficiency constraints. Leverages all sources of parallelism in DCNNs, in this paper w...

متن کامل

A Code Optimization Framework for Performance Portability of GPU Kernels onto Custom Accelerators

The shift toward parallel computing has resulted into a growing interest in computing systems with heterogeneous processing modules. Reconfigurable devices are often employed in such heterogeneous systems due to their low power and parallel processing benefits. An important issue in the programmability of these systems is the need for a single programming interface. Recent works have leveraged ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1708.00117 شماره

صفحات -

تاریخ انتشار 2017

Compiling Deep Learning Models for Custom Hardware Accelerators

نویسندگان

چکیده

منابع مشابه

Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks

Snowflake: A Model Agnostic Accelerator for Deep Convolutional Neural Networks

Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

A Code Optimization Framework for Performance Portability of GPU Kernels onto Custom Accelerators

عنوان ژورنال:

اشتراک گذاری